Lecture’s Plan

  1. CNN
  2. Encoder-Decoder
  3. Attention
  4. Transformers

CNN for Text

Convolutional Neural Network (CNN)

  • Convolutional Neural Networks, or Convolutional Networks, or CNNs, or ConvNets
  • For processing data with a grid-like or array topology
    • 1-D grid: time-series data, sensor signal data
    • 2-D grid: image data
    • 3-D grid: video data
  • CNNs include four key ideas related to natural signals:
    • Local connections
    • Shared weights
    • Pooling
    • Use of many layers

CNN Architecture

  • Intuition: Neural network with specialized connectivity structure
    • Stacking multiple layers of feature extractors, low-level layers extract local features, and high-level layers extract learn global patterns.
  • There are a few distinct types of layers:
    • Convolutional Layer: detecting local features through filters (discrete convolution)
    • Non-linear Layer: normalization via Rectified Linear Unit (ReLU)
    • Pooling Layer: merging similar features

Building-blocks for CNNs

(1) Convolutional Layer

  • The core layer of CNNs
  • Convolutional layer consists of a set of filters, \(W_{kl}\)
  • Each filter covers a spatially small portion of the input data, \(Z_{i,j}\)
  • Each filter is convolved across the dimensions of the input data, producing a multidimensional feature map.
  • As we convolve the filter, we are computing the dot product between the parameters of the filter and the input.
  • Deep Learning algorithm: During training, the network corrects errors and filters are learned, e.g., in Keras, by adjusting weights based on Stochastic Gradient Descent, SGD (stochastic approximation of GD using a randomly selected subset of the data).
  • The key architectural characteristics of the convolutional layer is local connectivity and shared weights.

Convolutional Layer: Local Connectivity

  • Neurons in layer m are only connected to 3 adjacent neurons in the m-1 layer.
  • Neurons in layer m+1 have a similar connectivity with the layer below.
  • Each neuron is unresponsive to variations outside of its receptive field with respect to the input.
    • Receptive field: small neuron collections which process portions of the input data.
  • The architecture thus ensures that the learnt feature extractors produce the strongest response to a spatially local input pattern.

Convolutional Layer: Shared Weights

  • We show 3 hidden neurons belonging to the same feature map (the layer right above the input layer).
  • Weights of the same color are shared—constrained to be identical.
  • Replicating neurons in this way allows for features to be detected regardless of their position in the input.
  • Additionally, weight sharing increases learning efficiency by greatly reducing the number of free parameters being learnt.

Convolution without padding

Convolution with padding

(2) Non-linear Layer

  • Intuition: Increase the nonlinearity of the entire architecture without affecting the receptive fields of the convolution layer
  • A layer of neurons that applies the non-linear activation function, such as,
    • \(f(x)=max⁡(0,x)\) - Rectified Linear Unit (ReLU);
    fast and most widely used in CNN
    • \(f(x)=\text{tanh}x\)
    • \(f(x)=|\text{tanh}⁡𝑥|\)
    • \(f(x)=(1+𝑒^{−𝑥})^{−1}\) - sigmoid

(3) Pooling Layer

  • Intuition: to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting
  • Pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum value of the features in that region.

Pooling ( down sampling )

Other Layers

  • The convolution, non-linear, and pooling layers are typically used as a set. Multiple sets of the above three layers can appear in a CNN design.
    • Input → Conv. → Non-linear → Pooling Conv. → Non-linear → Pooling → … → Output
  • Recent CNN architectures have 10-20 such layers.
  • After a few sets, the output is typically sent to one or two fully connected layers.
    • A fully connected layer is a ordinary neural network layer as in other neural networks.
    • Typical activation function is the sigmoid function.
    • Output is typically class (classification) or real number (regression).

Other Layers

  • The final layer of a CNN is determined by the research task.
  • Classification: Softmax Layer \[P(y=j|\boldsymbol{x}) = \frac{e^{w_j \cdot x}}{\sum_{k=1}^K{e^{w_k \cdot x}}}\]
    • The outputs are the probabilities of belonging to each class.
  • Regression: Linear Layer \[f(\boldsymbol{x}) = \boldsymbol{w} \cdot \boldsymbol{x}\]
    • The output is a real number.

Implementation for text in Python

Convolutional Neural Networks (CNNs)

Main CNN idea for text:

Compute vectors for n-grams and group them afterwards



Example: “this takes too long” compute vectors for:

This takes, takes too, too long, this takes too, takes too long, this takes too long

CNN for text classification

CNN with multiple filters

Python CNN Implementation

Build a CNN in Keras

  • The Sequential model is used to build a linear stack of layers.
  • The following code shows how a typical CNN is built in Keras.
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import SGD
![image.png](attachment:image.png)

Note:

Dense is the fully connected layer;

Flatten is used after all CNN layers

and before fully connected layer;

Conv2D is the 2D convolution layer;

MaxPooling2D is the 2D max pooling layer;

SGD is stochastic gradient descent algorithm.

Build a CNN in Keras

Build a CNN in Keras

Build a CNN in Keras

Encoder-Decoder

Encoder-Decoder

  • RNN: input sequence is transformed into output sequence in a one-to-one fashion.
  • Goal: Develop an architecture capable of generating contextually appropriate, arbitrary length, output sequences
  • Applications:
    • Machine translation,
    • Summarization,
    • Question answering,
    • Dialogue modeling.

Simple recurrent neural network illustrated as a feed-forward network

 

Most significant change: new set of weights, U - connect the hidden layer from the previous time step to the current hidden layer. - determine how the network should make use of past context in calculating the output for the current input.

Simple-RNN abstraction

RNN Applications

Sentence Completion using an RNN

  • Trained Neural Language Model can be used to generate novel sequences
  • Or to complete a given sequence (until end of sentence token <> is generated)

Extending (autoregressive) generation to Machine Translation

  • Translation as Sentence Completion!

(simple) Encoder Decoder Networks

 

  • Encoder generates a contextualized representation of the input (last state).
  • Decoder takes that state and autoregressively generates a sequence of outputs

General Encoder Decoder Networks

Abstracting away from these choices

  1. Encoder: accepts an input sequence, \(x_{1:n}\) and generates a corresponding sequence of contextualized representations, \(h_{1:n}\)
  2. Context vector \(c\): function of \(h_{1:n}\) and conveys the essence of the input to the decoder.
  3. Decoder: accepts \(c\) as input and generates an arbitrary length sequence of hidden states \(h_{1:m}\) from which a corresponding sequence of output states \(y_{1:m}\) can be obtained.

Popular architectural choices: Encoder

Decoder Basic Design

  • produce an output sequence an element at a time

Decoder Design
Enhancement

Decoder: How output y is chosen

  • Sample soft-max distribution (OK for generating novel output, not OK for e.g. MT or Summ)
  • Most likely output (doesn’t guarantee individual choices being made make sense together)

Attention

Flexible context: Attention

Context vector \(c\): function of \(h_{1:n}\) and conveys the essence of the input to the decoder.

Flexible context: Attention

Context vector \(c\): function of \(h_{1:n}\) and conveys the essence of the input to the decoder.

Flexible?
- Different for each \(h_i\) - Flexibly combining the \(h_j\)

Attention (1): dynamically derived context

  • Replace static context vector with dynamic \(c_i\)
  • derived from the encoder hidden states at each point \(i\) during decoding

Ideas: - should be a linear combination of those states \[c_i = \sum_j{\alpha_{ij}h^e_j}\] - \(\alpha_{ij}\) should depend on?

Attention (2): computing \(c_i\)

  • Compute a vector of scores that capture the relevance of each encoder hidden state to the decoder state \(h_{i-1}^d\) \[score(h_{i-1}^d, h_j^e)\]

  • Just the similarity \[score(h_{i-1}^d, h_j^e) = h_{i-1}^d \cdot h_j^e\]

  • Give network the ability to learn which aspects of similarity between the decoder and encoder states are important to the current application.

\[score(h_{i-1}^d, h_j^e) = h_{i-1}^d W_S h_j^e\]

Attention (3): computing \(c_i\)
From scores to weights

  • Create vector of weights by normalizing scores

\[ \begin{align} a_{ij} &= \text{softmax}(score(h_{i-1}^d, h_j^e)\ \forall j \in e) \\ &= \frac{exp(score(h_{i-1}^d, h_j^e))}{\sum_k{exp(score(h_{i-1}^d, h_k^e))}} \end{align} \]

  • Goal achieved: compute a fixed-length context vector for the current decoder state by taking a weighted average over all the encoder hidden states.

Attention: Summary

Explain Y. Goldberg different notation

Intro to Encoder-Decoder and Attention (Goldberg’s notation)

Transformers

Transformers (Attention is all you need 2017)

High-level architecture

  • Will only look at the ENCODER(s) part in detail

Key property of Transformer: word in each position flows through its own path in the encoder. - There are dependencies between these paths in the self-attention layer. - Feed-forward layer does not have those dependencies => various paths can be executed in parallel !

Visually clearer on two words

  • dependencies in self-attention layer.
  • No dependencies in Feed-forward layer

Self-Attention

While processing each word it allows to look at other positions in the input sequence for clues to build a better encoding for this word.

Step1: create three vectors from each of the encoder’s input vectors:

Query, a Key, Value (typically smaller dimension).

by multiplying the embedding by three matrices that we trained during the training process.

Self-Attention

Step 2: calculate a score (like we have seen for regular attention!)  how much focus to place on other parts of the input sentence as we encode a word at a certain position.

Take dot product of the query vector with the key vector of the respective word we’re scoring.

E.g., Processing the self-attention for word “Thinking” in position \(\text{#}1\), the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

Self Attention

  • Step 3 divide scores by the square root of the dimension of the key vectors (more stable gradients).
  • Step 4 pass result through a softmax operation. (all positive and add up to 1)

Intuition: softmax score determines how much each word will be expressed at this position.

Self Attention

Step6 : sum up the weighted value vectors. This produces the output of the self-attention layer at this position

Self Attention

Step6 : sum up the weighted value vectors. This produces the output of the self-attention layer at this position

More details: - What we have seen for a word is done for all words (using matrices) - Need to encode position of words - And improved using a mechanism called “multi-headed” attention

(kind of like multiple filters for CNN)

see https://jalammar.github.io/illustrated-transformer/

The Decoder Side

Menti

Summary

Summary: what did we learn?

Time for Practical 7!